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1. INTRODUCTION 

The 21* century has witnessed the spectacular progress of artificial intelligence (AI) [1-4]. Instant messaging 
systems have been fully implemented as one of the primary communication technologies [5, 6]. Nowadays, chatbots, 
an application of AI, are replacing humans in interacting with users of the systems [7-11]. They have 
demonstrated their value of improving the overall user experience in different situations and contexts such as 
customer services [12], education [13] and online business [14]. They are opening up a new way for 
companies to connect with the world and, most critically, with their clients by increasing the reach 
of messaging applications. As far as we know, people cannot handle their work, for which in this case, 
is replying messages, for 24 hours a day and 7 days a week due to the limit of human strength and agility. 
Therefore, chatbots emerge as a suitable replacement for humans [15, 16]. They have been playing a crucial 
role in many systems that require high stability and fast response [17], as most companies understand 
the importance of customers' experience that just a small problem could make the customers losing 
their interest in the companies’ services. An unpleasant experience could lead to losing a customer to the 
company’s competitors [18-21]. 

In Figure 1, the message response times of two different management pages having the same kind 
of service, Homestay, is presented (some information has been blurred due to privacy reasons). The first one 
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typically replies instantly, while the second one usually replies within a few hours. Assuming that these two 
services have the same quality, price, fame, open and close times, and are locating near each other. 
Most customers’ preferred customer services’ value is saving their time. They will also use the services that 
answers their questions quickly and accurately, often putting price of services and products the 
second [22, 23]. Therefore, most customers would choose the first service. 
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Figure 1. Message response speeds between two homestay service's fanpages, (a) The page replying 
instantly, (b) The page replying within a few hours 


However, in communication, using voice is the most natural and effective way [24]. Therefore, 
voicebots are introduced. Typically, a voicebot is a combination of a chatbot, a TTS module and an STT 
module. After receiving the customer's voice message, the bot will pass it to the STT module to analyze 
and convert it into text. Then, the bot generates a response message then converts it into an audio file 
by the TTS module and plays it in the user interface. 

Our objective is to create a professional voicebot that can run as a web application on users’ Internet 
browsers. It will be convenient for users as it does not require any installation on users’ computers. The user 
interface should be as friendly and simple as possible to minimize loading time and improve user experience. 
The remaining of this work is organized as follows: In section 2, we present the research methodology. 
In section 3, results and analysis will be discussed. Finally, section 4 draws our research conclusion about the 
developed FPT.AI-based voicebot. 


2. PROPOSED METHOD 
2.1. Development 
2.1.1. Front-end and back-end development 

In this work, the front-end languages used to create a graphical user interface (GUI) are hypertext 
markup language (HTML), cascading style sheets (CSS), and JavaScript (JS). Bootstrap framework 
and JQuery library were also utilized because of their customization, speed of development and ease of use. 
For back-end development, PHP is used. In addition, WebAudioRecorder JS library [25] is in place to record 
voice commands and encode them to MP3 files before performing any analyzing tasks. 


2.1.2. FPT.AT's utilized modules 

FPT.AI's modules [26] are chosen because it is free and supports PHP. Besides, contacting 
for support in Vietnam (where the population is approaching 100 million) is easier and faster than through 
other commercialized tools supported by Google or Microsoft since their headquarters are in oversea 
countries. In addition, the provided tool [26] has advantages in supporting local language, Vietnamese, over 
other similar products offered by Google or Microsoft. 
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2.2. Dataset 
2.2.1. Module testing 

For TTS module testing, random texts from CNN and VnExpress news pages were used. Ten (10) 
text blocks which are small-length text (words or phrases), medium-length text (sentences) and a very-long-length 
text (paragraphs) were chosen to be tested. For STT module testing, 15-second-long files in FPT open speech 
dataset (FOS) [27, 28] and LibriSpeech Datasets (LS) [29] and two music files were chosen. The audio files 
available in FOS are already in MP3 format while audio sources from LS must be converted to MP3 before 
testing as the modules do not support the processing of non-MP3 audio files. 


2.2.2. Voicebot testing 
Messages about homestay service were sent to check whether the voicebot understands or not. Also, 
it should be tested with off-topic messages to see how it replies. 


2.3. Voice recorder’s configuration 

Figure 2 presents the configuration of the voice recorder integrated into the developed voicebot. 
Here, encodingType defines the type of output file after encoding the recorded data. workerDir presents 
the directory containing the script WebAudioRecorder.JS. numChannels stands for Number of Channels 
of the encoded file, i.e., 2, because MP3 files only support 2 channels. The time limit for recording is set 
to 15 seconds, as voice commands need to be short and easy to understand. encodeAfterRecord represents 
the encoding method. We set it to true because encoding after recording is safer than encoding while 
recording. bitrates, which is the number of bits per second encoded in the MP3 file, is set to 760kbps keeping 
the record not only light-weight but also acceptable for the module to process. 


recorder = new 

WebAudioRecorder (input, { 
workerDir: “js/”, 
encoding: “mp3”, 
numChannels: 2 


)); 

recorder.setOptions ( { 
toimebrimrtsi-.:l5; 
encodeAfterRecord: true, 
mp3: {bitRate: 160} 

})? 





Figure 2. Voice recorder’s configuration 


2.4. Testing process 
2.4.1. Module testing 

For modules testing, a PHP file is created to receive text or audio sources and make requests 
to the server by its application programming interface (API). Log files are then written. In this work, the TTS 
module testing process is presented as a flowchart in Figure 3(a). Firstly, the tester inputs testing text into 
a text field and click on Submit button. StartTime will be written. The application must check whether it can 
connect to the server or not. If not, it displays an error message and ends the testing process. Otherwise, it 
will send the input text to the server and wait for the server’s TTS conversion stage. After that, it receives 
the processed audio file and plays in the user interface. Then the resultant EndTime and Processing Time will 
be written in the log file. Finally, the test process ends. 

STT module testing process is presented as a flowchart in Figure 3(b). Firstly, audio source 
information will be sent to the application. StartTime will be written. The application must check whether it 
can connect to the server. If not, it displays an error message and ends the testing process. Otherwise, it will 
send the audio source to the server and wait for the server’s processing. After that, it receives a text converted 
from the audio source and shows in the user interface. Then the results, EndTime and Processing Time, will 
be written in the log file. Finally, the test process ends. 

In Table 1, testing data for each module are presented. Five blocks of Vietnamese text quoted from 
VnExpress [30, 31] and five blocks of English text quoted from CNN [32, 33] were requested to generate an 
audio file corresponding to each block. In addition, ten audio files containing Vietnamese voices and ten 
audio files containing English voices were used to test the STT module. To challenge the module more, 
2 music songs performed by both Vietnamese and American artists were tested. The English music file to be 
chosen in this test, which is a 15-second cut of the song called Boulevard of Broken Dreams-performed by 
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Green Day, is named English_Music.mp3. The Vietnamese one which is also a 15-second cut is named 
Vietnamese Music.mp3. It is the cut version of the song called Dù có cách xa - performed by Dinh Manh 
Ninh. All of the testing audio files have durations of 15 seconds to match with the voice recorders configure 
which is shown earlier. 
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Figure 3. Modules testing process flowchart, (a) TTS module, (b) STT module 
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Table 1. Module testing data 
Module Data 
TTS Bénh nhan. (2 words) 
Cung voi giam gia. (4 words) 
Phong toa bénh vién Lao Phéi Quang Ninh. (8 words) 
Só Y té Quang Ninh sang 15/3 cho biét dà cách ly toàn bó nhán 
vién y té và 113 bénh nhán nói trà Bénh vién Lao và Phói, do bó 
me "bénh nhan 52" lam viéc tai day. (39 words) 
Phuong án diéu hành giá làn này, theo nhà chúc trách, dám bao 
giá bán xáng dàu trong nuóc phàn ánh xu huóng giá thành phám 
thé giói, phü hop vói tinh hinh kinh té xà hói. Mát khác, müc trích 
lap Quy Binh ón giá xáng dau 6 müc hop ly dé co du dia diéu 
hành giá xáng dàu trong nhüng ky tiép theo trong béi canh giá 
xüng dàu dang có dién bién khó luóng, giám giá ó tat ca mat hang. 
(85 words) 
Infected people. (2 words) 
New studies in several countries. (5 words) 
US Secretary of Health and Human Services. (7 words) 
These officials have emphasized that the virus is spread mainly by 
people who are already showing symptoms, such as fever, cough 
or difficulty breathing. If that's true, it's good news, since people 
who are obviously ill can be identified and isolated. (41 words) 
Albert II and Charlene have two young children -- twins Princess 
Gabriella and Prince Jacques. Gabriella was born two minutes 
earlier than Jacques, but he will one day take over the monarchy. 
As a man, he will take precedence over his older sister. Albert II 
has two older children, who were born out of wedlock and as such 
are not considered royal. The royal house received $54.4 million 
in 2020, according to the principality's official budget. Of that, 
$14.6 million goes to the Prince. (84 words) 
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Table 1. Module testing data (continue) 
Module Data 

STT FPTOpenSpeechData_Set001_V0.1_002331.mp3 
FPTOpenSpeechData_Set001_V0.1_006070.mp3 
FPTOpenSpeechData_Set001_V0.1_008636.mp3 
FPTOpenSpeechData_Set001_V0.1_011371.mp3 
FPTOpenSpeechData_Set001_V0.1_012565.mp3 
FPTOpenSpeechData_Set001_V0.1_013642.mp3 
FPTOpenSpeechData_Set002_V0.1_000201.mp3 
FPTOpenSpeechData_Set002_V0.1_004204.mp3 
FPTOpenSpeechData_Set002_V0.1_004461.mp3 
FPTOpenSpeechData_Set002_V0.1_007016.mp3 
672-122797-0001.mp3 
672-122797-0042.mp3 
672-122797-0064.mp3 
1089-134686-0018.mp3 
1089-134691-0012.mp3 
1188-133604-0011.mp3 
1188-133604-0024.mp3 
1221-135766-0010.mp3 
1221-135767-0016.mp3 
1284-1180-0019.mp3 
English_Music.mp3 
Vietnamese_Music.mp3 


2.4.2. Voicebot testing 

Table 2 shows message templates to test whether the voicebot answers correctly or not. Besides, 
input random text or random voice commands will be given to see the bot’s responses. Each question has its 
keywords to help the bot determine which question the user is asking. The voicebot will reply depending 
on the identified keywords. 

For example, the keyword of this question omestay con phong khóng?" (Is this homestay 
available for rental?) is "cón phong" (available/available for rental); or for this question "Sau khi thué có 
thé dói phóng hoác hüy dáng kí khóng? " (After renting, could I change to another place or cancel renting?) 
are "dói phóng " (change to another place) and "uy dáng kí" (cancel renting). 


66 


Table 2. Voicebot testing data 
Questions 

Homestay còn phòng không? (Is this homestay available for rental?) 
Homestay bao gém nhiing gi? (What does this homestay consist of) 
Dia chi homestay 6 dau? (What is the address of this homestay) 
Tu homestay minh cé thé di choi nhitng dau? (Where can I go for a trip from this Homestay?) 
Giá thuê homestay nhu thé nao? (What is the rental price?) 
Làm thé nào dé dát chó? (What I need to do to rent this homestay?) 
Hay cung cap thong tin chuyén khoan. (Please provide banking information.) 
T6i can cung cap thong tin ca nhan tai dau? (Where do I need to provide my personal information?) 
Gió check-in và check-out thé nào? (What is check-in and check-out time?) 
Sau khi thué có thé dói phóng hoác hüy dáng kí khóng? (After renting, could I change to another place or cancel renting?) 











3. RESULTS AND DISCUSSION 
3.1. Module testing 

Test results of TTS and STT modules are presented in Table 3. In the TTS test, the module had 
mean trial processing time (MTPT) of 6.720 seconds for Vietnamese. In total, it processed 138 Vietnamese 
words and achieved 100% accuracy. In addition, it had MTPT of 8.508 seconds for English. In total, 
it processed 139 English words but the module only achieved approximately 72.66% accuracy. Meanwhile, 
the STT module achieved 90.51% accuracy in the Vietnamese audio test. It can be seen that the accuracies 
after testing with LS dataset and music files is 0%. 

Figure 4. presents an example of a wrong case in English audio conversion. The expected output 
was “The place he had was a very good one, the sun shone on him, as to fresh air there was enough of that, 
and round him grew many large-sized comrades, pines as well as firs”. It can be seen that FPT.AI’s modules 
only support Vietnamese well rather than other languages, i.e., English. The reason is that the modules were 
mainly developed for targeting the Vietnamese market. Therefore, English and other languages text 
and audio cannot be processed well, and the outputs cannot be expected to be as highly accurate as 
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Vietnamese ones. It should be noted that the conversion results were also depending on other factors, such as 
audio noise [30], audio quality [31], audio volume [32], tone and accent of the speaking person [33]. 


Table 3. Modules testing result 














Quantity Correct Mean Trial Proc. Time 
TTS Vietnamese 138 words 138 words 6.720 seconds 
English 139 words 101 words. 8.508 seconds 
FOS 432 words 391 words 11.422 seconds 
STT LS 395 words 0 words 23.417 seconds 
Music 2 files 0 files 21.058 seconds 





File name: 672-122797-0001.mp3; 
Length: 00:00:15; 
Analyzed content: Quay xi ga Cuba 


Real san Anfield ca rau. Thank you 
may Need More sai hóm ra khói su chü 
ý Ö sō; 

Processing time: 00:00:23.40. 





Figure 4. Detailed results of a wrong case of English audio file processing 


From the results, it is observed that the STT module cannot handle music audio sources, despite 
which language of their content is. Figure 5 presents the detailed result after processing the two music files 
done by STT module. The expected output for the English file is "7m walking down the line that divides me 
somewhere in my mind on the border line of the edge and where I walk alone", and for the Vietnamese file 
is “Ngay hóm qua em chot mang nàng tói, chot làm ngán ngo dàm say anh chiéc hón bói rói, chot làm muôn 
cáu ca, chot goi mùa xuân qua, dimg roi tay trong tay anh nghe mua xudn 6i dam say, vàd...". Thus, 
the system’s actual results are different from the expected outputs. Hence, the developed modules cannot be 
used to process music audio sources. 


File name: Vietnamese Music.mp3; File name: English Music.mp3; 
Length: 00:00:15; Length: 00:00:15; 
Analyzed content: Ngày hôm nay. Tròi Analyzed content: Sao. Di. Qua.; 


oi, nhó nam say you do. Gol mua xudan Processing time: 00:00:21.418. 
qua duóng dáy deo trén tay anh. 

Dini 

Processing. time: 00:00:20.698. 





(a) (b) 


Figure 5. Detailed results after processing the two music files, (a) Vietnamese music file, 
(b) English music file 


3.2. Voicebot testing 

In this test, the voicebot achieved 10046 accuracy for the prepared questions. The processing time 
of the voicebot started from when the user clicked on the Stop recording button to when the user received 
the corresponding response message. The voicebot took 24.738 seconds in total to convert ten questions of 
the user from speech into text to process; 60.174 seconds in total to find ten answers and convert 
them from text into audio files to play in the user interface. On average, the trial processing time is 
4.246 seconds for 20 questions and answers. However, if the user asks in different ways, such as asking 
the shortened questions, or asking questions containing different keywords, the bot will not understand 
the intention of the user, and it will reply to off-topic messages. For off-topic messages, the voicebot 
randomly gets one of three messages: "Xin lói! Tói chua hiéu y ban là gi." (I am sorry! I didn't understand 
what you meant), "Ban vui lóng néu ró cáu hói nhé!" (Please specify your question!) and "Xin lói ban! Tói 
chwa hiu ban nói gi" (I am sorry! I didn't understand what you said) and replies to the user. 
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4. CONCLUSION 

In this research, FPT.AI-based STT and TTS modules were utilized to develop and test a voicebot 
supporting Homestay Service’s queries. As seen from the obtained results, one can conclude that the modules 
are only suitable for Vietnamese text and speech processing due to their imprecision in processing another 
language (i.e., with English, the accuracies are 72.66% and 0% for TTS and STT respectively). For 
Vietnamese conversation content, the accuracies of TTS module and STT module are 100% and 90.51% 
correspondingly. The voicebot works well with the prepared questions and statements. However, the bot’s 
detection accuracy still needs to be improved in the future to be able to recognize more complex questions 
and statements. 
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